Similar n-gram language model

نویسندگان

Christian Gillot

Christophe Cerisara

David Langlois

Jean Paul Haton

چکیده

This paper describes an extension of the n-gram language model: the similar n-gram language model. The estimation of the probability P (s) of a string s by the classical model of order n is computed using statistics of occurrences of the last nwords of the string in the corpus, whereas the proposed model further uses all the strings s′ for which the Levenshtein distance to s is smaller than a given threshold. The similarity between s and each string s′ is estimated using co-occurrence statistics. The new P (s) is approximated by smoothing all the similar n-gram probabilities with a regression technique. A slight but statistically significant decrease in the word error rate is obtained on a state-of-the-art automatic speech recognition system when the similar n-gram language model is interpolated linearly with the n-gram model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Class-based language model adaptation using mixtures of word-class weights

This paper describes the use of a weighted mixture of classbased n-gram language models to perform topic adaptation. By using a fixed class n-gram history and variable word-given-class probabilities we obtain large improvements in the performance of the class-based language model, giving it similar accuracy to a word n-gram model, and an associated small but statistically significant improvemen...

متن کامل

A State-space Method for Language Modeling

In this paper, a new state-space method for language modeling is presented. The complexity of the model is controlled by choosing the dimension of the state instead of the smoothing and back-off methods common in n-gram modeling. The model complexity also controls the generalization ability of the model, allowing the model to handle similar words in similar manner. We compare the state-space mo...

متن کامل

Growing an n-gram language model

Traditionally, when building an n-gram model, we decide the span of the model history, collect the relevant statistics and estimate the model. The model can be pruned down to a smaller size by manipulating the statistics or the estimated model. This paper shows how an n-gram model can be built by adding suitable sets of n-grams to a unigram model until desired complexity is reached. Very high o...

متن کامل

Segmenting DNA sequence into 'words' based on statistical language model

[Abstract] This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach...

متن کامل

Multi Class-based n-gram Language Model for New Words Using Web Data

Out-of-vocabulary (OOV) words cause a serious problem for automatic speech recognition (ASR) system. Not only it will be miss-recognized as an in-vocabulary word with similar phonetics, but the error will also affect nearby words to make errors. Language models (LMs) for most of open vocabulary ASR systems treat OOV words as one entity, ignoring the linguistic information. In this paper we pres...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Similar n-gram language model

نویسندگان

چکیده

منابع مشابه

Class-based language model adaptation using mixtures of word-class weights

A State-space Method for Language Modeling

Growing an n-gram language model

Segmenting DNA sequence into 'words' based on statistical language model

Multi Class-based n-gram Language Model for New Words Using Web Data

عنوان ژورنال:

اشتراک گذاری